CDS DS 453 / 653 — Crypto for Data Science¶

Syllabus¶

Course Overview¶

Instructors

  • Professor: Mayank Varia (varia@bu.edu)
  • TA: Shreyas Sudarsan (shreyas9@bu.edu)

Meetings

  • Lecture: Tuesdays and Thursdays at 9:30-10:45am (EPC 204)
  • Discussion: Mondays at 9:05-9:55am or 11:15am-12:05pm (CCDS 164)
  • Office hours: Posted on the Piazza Staff page, currently Mon 3-5pm for Shreyas and Tues 2:30-4pm for Mayank

Websites

  • Piazza: https://piazza.com/bu/spring2024/ds453653/
  • Gradescope: https://www.gradescope.com/courses/710247 (entry code B2Y7G5)
  • Textbooks: all are available online for free, with links on the Piazza Resources page

Quiz dates (mark on your calendars now!)

  • Monday, February 5
  • Wednesday, February 21 (this day follows a Monday schedule)
  • Monday, March 25
  • Tuesday, April 16
  • The final exam period (tentatively scheduled for May 6 at 9-11am)

Prerequisites

  • Mathematical maturity with linear algebra and probability concepts at the level of DS 121 and 122, or equivalent
  • Familiarity with algorithms at the level of DS 320, CS 330, or equivalent

Please read the syllabus carefully. You are responsible for adhering to course policies at all times, especially the academic code of conduct, plagiarism, AI, and collaboration policies.

The syllabus can be downloaded from a link at the top of the Piazza course information page. These lecture notes contain an abbreviated version of the syllabus, and should not be considered as a substitute for the syllabus itself.

Course Information¶

Course Description¶

CDS DS 453 investigates techniques for performing trustworthy data analyses without a trusted party, and for conducting data science without data sharing.

The first half of the course investigates cryptocurrencies, the blockchain technology underpinning them, and the incentives for each participant. Students will learn how to create transactions, develop smart contracts, and participate in decentralized exchanges. Then, we take a deeper dive into consensus mechanisms, historical and modern, that maintain stability if a certain fraction of the participants or computing power behaves honestly.

The second half of the course focuses on privacy and anonymity using advanced tools from cryptography. We study zero knowledge proofs and their role in preventing re-identification attacks and increasing scalability of blockchains. We also study secure multiparty computation and its role in designing private contracts and atomic swaps. The course concludes with a broader exploration into the power of conducting data science without being able to see the underlying data.

Within the undergraduate Data Sciences major, this course satisfies the DS methodology elective in the “scalable & trustworthy DS” category.

In spring 2024, this course does not satisfy any Hub units.

Learning Objectives¶

There are five parts to this course, each with their own learning objectives.

  1. Authenticity without interaction: using hash functions and digital signatures to authenticate people and data over the Internet
  2. Currencies without centralization: exploring the mechanics of cryptocurrencies, blockchains, and mining
  3. Consensus without trust: reaching agreement over the Internet using distributed algorithms
  4. Data analysis without data sharing: using crypto to work with data you cannot see, or to prove compliance with a stated rule, policy, or regulation
  5. Social, policy, legal, and regulatory impacts: discussing (in)appropriate uses of the tech we learn in this course

The full list of 14 learning objectives is available in the syllabus; roughly, they correspond to the weeks of the course.

The full course schedule is available on the Piazza Course Schedule page. It contains the required textbook reading, lecture notes/video, and assignments due each week. The schedule might change slightly over time, and I welcome suggestions from you about topics that you want to see covered in this course.

Prerequisites¶

No prior background in cryptography is necessary.

This course requires mathematical maturity with linear algebra and probability concepts at the level of DS 122 or equivalent.

It also requires familiarity with algorithms and big-O notation at the level of DS 320, CS 330, or equivalent.

If you do not meet the prerequisites of this course, please talk with me after today's lecture.

Course websites¶

There are 2 websites for this course, Piazza and Gradescope.

  • The course schedule and homework assignments will be posted on Piazza at https://piazza.com/bu/spring2024/ds453653. You are responsible for any announcements on Piazza, so please register using an email address that you check regularly. (Also, only email the instructors for a personal or administrative manner; all scientific questions must be sent via Piazza.)

  • Completed assignments must be submitted on Gradescope; this is the only way to receive credit. Please register at https://www.gradescope.com/courses/710247 using entry code B2Y7G5.

Course Assessment¶

A typical week of this course contains:

  • A discussion section or quiz on a Monday,
  • Lectures on Tuesday and Thursday, and
  • Assignments due on Thursday evening at 8pm (which is also when we release the next week's assignments).

There are four types of graded assignments. The grading system is likely different than what you're used to in most courses. The main rules-of-thumb to remember for everything in this course is:

  • Everything is graded Pass / Not Passed (there is no partial credit)
  • You must earn 4 out of 5 points to pass each assignment
  • You have multiple opportunities to complete each assignment -- but there are clearly-stated limits to the number of chances that you have.

1. Reading Assessments (13 in total)¶

You must read a portion of a textbook every week. The required reading is listed on the Piazza Course Schedule page.

To assess your understanding of the material, we will post 5 multiple-choice or short-answer questions on Gradescope each week.

  • You must correctly answer 4 of the 5 questions to receive credit for the weekly reading assessment.
  • The assessment is auto-graded, and you can submit as many times as you wish before the due date.

2. Programming Assignments (9 in total)¶

Programming assignments are due in most weeks of the course, but there are intentional gaps around the times of quizzes and project due dates to allow you time to work on those. Each assignment contains questions collectively worth 5 points.

  • You must receive 4 of the 5 available points to pass the assignment. This is a hard cutoff: there is no partial credit awarded for receiving fewer than 4 points.
  • The assessment is auto-graded, and you can submit as many times as you wish before the due date.

3. Quizzes (4 in total)¶

There are 4 quizzes in this course, one for each of Parts 1 through 4 of the course. (There is no quiz for Part 5; it is evaluated instead in the assessments.)

Quizzes occur either during the discussion or lecture sections; the exact quiz dates are stated at the top of the syllabus and posted on the Piazza Course Schedule page.

Each quiz contains 5 questions. Your responses are graded on correctness and clarity. You will receive credit for a response that meets the stated expectations of the question, sufficiently explains and demonstrates understanding of the concepts, and does not contain any significant gaps, errors, or omissions.

  • You must receive credit on 4 of the 5 questions to pass the quiz. If you get all 5 questions correct, you will receive a grade of High Pass.
  • You can attempt each quiz 3 times: on the initial quiz date, on the make-up quiz date, and during the final exam period (if needed). Only the best score counts (except in cases of academic misconduct, which result in a grade of Not Passed.)
  • Also once in the semester: if you get 3 questions right on a quiz, you can convert it to a Pass grade by submitting a written report that provides the correct answers and explains why your initial quiz responses were incorrect.

4. Projects (2 in total)¶

There are two open-ended projects in this course.

  • Data science investigation of a public blockchain, with submission dates on March 7 and 21.
  • Using cryptographically secure computation and zero knowledge proofs, with submission dates on April 25 and May 1.

DS 653 students must complete both projects. DS 453 students only need to complete one project of their choice.

We will provide more details about the projects later in the semester. You will have 3-4 weeks to work on each project. As with everything else in this course:

  • Your final grade for the project is either Pass (P) or Not passed (N).
  • You will have two opportunities to submit your project work. In the first submission, I will either tell you that you have passed the project, or I will provide specific revision requirements that you must complete in the second submission to pass the project.

Course Grading¶

Your base grade in this course is based on passing a sufficient number of reading assessments, programming assignments, quizzes, and projects.

For DS 453 students: to receive a course grade of A, B, C, or D, you must complete all stated requirements below. In other words, the sentences below have four clauses, and you need to satisfy the conjunction of their requirements.

  • A: Pass 13 reading assessments, 9 programming assignments, 4 quizzes (with at least 2 High Pass scores), and 1 project.
  • B: Pass 11 reading assessments, 7 programming assignments, 4 quizzes, and 1 project.
  • C: Pass 9 reading assessments, 6 programming assignments, 3 quizzes, and 1 project.
  • D: Pass 7 reading assessments, 4 programming assignments, and 2 quizzes.
  • F: (Not meeting all requirements for a D.)

For DS 653 students: to obtain a base grade of A or B, you must pass 2 projects.

This base grade can be increased as follows:

  • If you also satisfy at least 1 of the 4 requirements of the next-higher grade level, then you get a + modifier. For example: if you sastisfy all B requirements and pass all 13 reading assessments, then you will earn a grade of B+.
  • If you also satisfy 3 of the 4 requirements of the next-higher grade level, then you get another step up. For example: if you satisfy the B level for everything and also the A level for reading, quizzes, and projects, then you will earn a grade of A-.

At any point in the course: you should know where you stand, and what additional work you need to perform in order to reach each grade level. The syllabus contains a sheet that you can use to track your progress in the course.

Course Policies¶

You are responsible for adhering to course policies at all times! If you have questions about the policies, feel free to ask me at any time -- for instance, before/after lecture, at office hours, or on Piazza.

Academic code of conduct¶

You must read and adhere to BU’s Academic Code of Conduct, which is available here: https://www.bu.edu/academics/policies/academic-conduct-code/. All work in this course is subject to BU’s Academic Code, so please familiarize yourself with this code, its definitions of misconduct and plagiarism, and its sanctions.

Plagiarism¶

All written work in this course must be original to you. If you consult outside texts, or other forms of assistance, cite these sources in the proper format—at a minimum, include the author, title, and website link for all external sources (books, journals, lectures, web sites, AI). We are required to report all suspected cases of plagiarism to the Academic Dean for review.

Academic integrity in computing coursework has some special aspects. Please review the examples of plagiarism as provided by the BU Computer Science department: https://www.bu.edu/cs/undergraduate/undergraduate-life/academic-integrity/.

Generative AI policy¶

All submitted work in this course must conform to the CDS Generative AI Assistance Policy, which you can read at https://www.bu.edu/cds-faculty/culture-community/gaia-policy/. Also, keep in mind that AI tools are often wrong!

Collaboration policy¶

The goal of homework assignments and projects is to learn. Hence, I encourage you to use any and all resources that can help you to learn the material: computers/calculators, Piazza, lecture notes, textbooks, other websites, and your fellow classmates. That said,

  • You cannot copy solutions from anyone else, or give your solutions to a classmate to copy; this is plagiarism.
  • You also cannot actively search for the solutions to the homework questions on the Internet or in any other source; this is misconduct.
  • Finally, your submission must list all people and resources that you used (discussed next).

By contrast, the goal of the quizzes is for you to show me what you have learned. So, any form of collaboration is strictly prohibited. Also, written notes and electronic aids (e.g., computers or calculators) are all forbidden from use during quizzes. (That said, I encourage you to collaborate with classmates when studying lecture materials and preparing for the quizzes.)

Violations of the conduct code and collaboration policy will result in receiving a score of N (not passed) on the assignment or quiz without the possibility to improve the grade, and may be grounds for referral to BU’s Academic Conduct Committee.

Documenting collaborators, sources, and AI tools¶

If you have any questions or concerns about the conduct code and collaboration policy, then I recommend that you ask me (in person or via private Piazza note) before taking an action that might be a violation. Additionally, at the bottom of each assignment and project, you must list:

  1. names of all classmates you worked with,
  2. all websites you used besides the ones listed in the lecture notes or textbooks, and
  3. all code that you used from other sources, including the exact prompts and responses from any Generative AI tool.

This is your opportunity to document and explain any collaborators or sources used and why you believe it adheres to the code of conduct and collaboration policy.

If I discover a violation of the policy that you have documented, then I will not refer the case to BU’s Academic Conduct Committee; the worst possible recourse is that I will ask you to redo the assignment (or create a new, similar assignment for you to do in its place). If I discover an undocumented violation after the fact, then this will be considered academic misconduct.

Accommodations¶

BU strives to be accessible, inclusive, and diverse in our facilities, programming, and academic offerings. Your experience in this course is important to the teaching staff.

If you have a disability or believe that you require a reasonable accommodation, please meet with BU Disability and Access Services as soon as possible at the beginning of the semester. Their office is at 25 Buick Street, Suite 300, and they can be contacted at 617-353-3658. Requests for accommodations are sent by that office to the Academic Dean who approves and returns them. Disability and Access Services then forwards them to the instructor.

If you have an accommodation letter, please send it (and only it) to me via email early in the semester, before tasks become due.

Learning environment¶

This course seeks to be inclusive of people of all genders, races, cultures, abilities, and sexual orientations. Please respect your fellow classmates and contribute toward a positive learning environment for everyone.

While I actively encourage discussion and debate on ideas, I won’t tolerate criticism of other people. Also, while you can use a computer for note-taking, you may not use a laptop or cellular phone in class for web surfing, sending messages, or anything else that can cause a distraction to your fellow classmates. Anyone who causes a disruption to the learning environment will be asked to leave.

Absence policy¶

This course follows BU’s policy on absences for religious observance. Otherwise, students should attend the lectures and discussion labs, either in person or virtually via Zoom.

Due to ongoing public health concerns: if you feel sick, please err on the side of caution and attend via Zoom or review the lecture notes and video afterward on Piazza.

Late work policy¶

You are responsible for submitting reading assessments, programming assignments, and projects electronically on Gradescope by the stated due date and time (typically, Thursday at 8pm).

Up to three times in the semester, you can submit a reading assessment, programming assignment, or project up to 3 days late (e.g., by Sunday at 8pm).

Any subsequent late assignments are not accepted (and any such auto-graded scores will be manually discarded after the fact), except in cases of long-term emergencies (e.g., family, medical); if this applies to you, inform me as soon as you are able.

Quiz absence policy¶

If you have a valid conflict with a quiz, you must send me a written note as soon as you are aware, and with a minimum of 2 weeks notice (barring extenuating circumstances, e.g., illness).

  • If you have an excused absence from a quiz, then you retain the right to have 3 opportunities to pass that quiz; the first opportunity is during the scheduled make-up quiz date, and we can schedule another make-up opportunity afterward.
  • If you have an unexcused absence from a quiz or a make-up quiz, then you forfeit that quiz-taking opportunity but can still attend the other opportunities to take that quiz; in other words, you would have two chances to pass that quiz rather than three.

The final exam period (i.e., the time of the final chance to take each quiz) can only be rescheduled in accordance with the university policy: https://www.bu.edu/reg/calendars/final-exams/policy/.

Regrade policy¶

You have the right to request a regrade of any project or quiz question. All regrade requests must be submitted via Gradescope, and must describe a factual error in our assessment. If you request a regrade for one question on a quiz or one part of a project, then we have the right to review the entire quiz or project. Beware that this may potentially result in a lower grade.

The reading assessments and programming assignments are auto-graded, so in general we do not anticipate regrades. That said, if you discover an error in our auto-grading system, please send us a Piazza note explaining the issue and we will look into it.

Lecture 1: Introduction to Crypto for Data Science¶

“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”

– Ben Adida

1.1 Introduction to Crypto(graphy)¶

This is a course about crypto... which is admittedly a term with many different meanings to different people. It's also an abbreviation with (at least) three different endings.

  • Cryptocurrencies
  • Cryptography
  • Cryptology

We will get to the topic of cryptocurrencies in a few minutes. But first, here is a question to start us off in this course: what is cryptology?

Cryptology comes from two Greek word parts:

  • kryptos, meaning secret or hidden
  • logy, meaning the study of

So the goal of cryptology is to study the art of keeping secrets.

In its modern form, crypto uses hard math problems as a way to encode data in order to restrict who can use it, how they can use it, and sometimes even when and where they can access it.

Cryptology can be divided into two sub-fields

  • Cryptography: the art of making codes.
  • Cryptanalysis: the art of breaking codes.

Both cryptography and cryptanalysis are challenging, multi-disciplinary tasks. They also reinforce each other, due to Schneier's law.

"Anyone, from the most clueless amateur to the best cryptographer, can create an algorithm that he himself can't break."

– Bruce Schneier

In this course we will almost exclusively focus on the defensive, cryptographic side. Even though I won't say much about cryptanalysis in this course, it is important to know that all of the building blocks I will show you in today's lecture have been vetted through decades of cryptanalysts, collectively spanning many centuries or millenia of people-years. No single one of us can compete with that.

"Don't roll your own crypto."

– (Every cryptographer)

Crypto is a scientific field at the intersection of many disciplines. To accomplish their goals, cryptographers use:

  • Algorithms techniques in their protocol designs.
  • Engineering techniques to produce vetted, secure software implementations of their protocols.
  • Complexity theory to demonstrate security reductions from new protocols to more well-studied ones.
  • Mathematics for cryptanalysis.

This class will primarily make use of the first two skills: algorithms and software engineering.

While we won't delve into the specific type of math used in cryptanalysis, nevertheless we will often use concepts from algebra and probability.

Crypto and data science¶

This is also a course about how crypto can be used within data science.

It's likely that you have already used crypto in all of your data science projects, perhaps without explicitly thinking about it.

Lock icon on kaggle.com

In this image, the lock icon in the web browser means that my computer has a protected connection to the kaggle.com servers.

At a high level, this means that nobody else on the internet can

  • see which dataset I am downloading from kaggle.com; in other words, the data remains confidential, and
  • tamper with the dataset in transit; in other words, data integrity is maintained.

(Note though that the kaggle.com servers can see what you are doing on their website. We will see later in the semester how even this can potentially be avoided.)

Using crypto is an essential part of providing security over the Internet, because it has always been designed as an open network. There is no wire that directly connects your computer with kaggle.com. Instead, the data flows through several other computers along the way.

Here is a picture of all of the computer servers connected to the Internet in 1968.

The Internet in 1968

Source: pwnallthethings on Twitter

Of course, the Internet is much larger now.

Growth of the Internet over time

Source: Wikipedia

But the design of the Internet has fundamentally remained the same.

“The Internet is just the world passing notes in a classroom.”

– Jon Stewart

Getting the crypto right is essential here.

  • If the system is vulnerable to decoding or tampering by unauthorized parties, then it can lead to universal, covert breaches of security. Moreover, everyone will have a false sense of security.

  • If the system is too onerous or slow to use, then people will ignore it and move to other, less secure options.

Goals of cryptography¶

We will see throughout this course that crypto can be used in many ways besides network transmissions on the Internet.

Still though, this example highlights a general theme of all of our applications:

crypto is a social science that masquerades as a mathematical science.

Our goal is to use the tools from math, computer science, and data science in order to design systems that help people.

At a high level, cryptographic systems attempt to provide (some or all of) the following three objectives.

  • C: Confidentiality, meaning to keep data or people's identities hidden from others.
  • I: Integrity, which means preventing raw data or the data science results from being altered, or at least allowing for detection of tampering.
  • A: Availability, or ensuring that data science is censorship-resistant.

These three security objectives are sometimes referred to as the "CIA triad."

Admittedly, these three objectives are both underspecified (lacking mathematical rigor) and overloaded (the terms have many different sub-meanings).

There are many different flavors of confidentiality, integrity, and availability that cryptography can provide... in fact, far more than we have time to cover in this course.

Types of security guarantees

Source: Handbook of Applied Cryptography, page 4

(Note: BU courses CS 538 and CS 558 also cover different aspects of cryptography and its application in modern digital systems. This course has a partial overlap with them, but we will cover many topics that they don't and vice-versa. You are welcome to register for those courses as well, but they are not required knowledge.)

Crypto and policy¶

Because crypto is a scientific field that has social impacts, it is probably inevitable that crypto has clashed with law and policy over the past five decades.

  • On the one hand, crypto offers ways to upend existing norms, laws, and rules... in both good and bad ways.
  • On the other hand, the use of crypto is influenced and regulated by the law and politicians.

"Cryptography rearranges power: it configures who can do what, from what. This makes cryptography an inherently political tool, and it confers on the field an intrinsically moral dimension."

– Phillip Rogaway

We will see aspects of the two-way connection between crypto and policy throughout the semester, and we will spend the last few weeks focusing exclusively on it.

1.2 Introduction to Crypto(currencies)¶

[Note: This section of the lecture is based on Johns Hopkins CS 601.641 by Prof. Abishek Jain.]

Pausing the cryptography discussion for a moment, let's provide an overview of cryptocurrencies like Bitcoin and Ethereum. We will spend the next few weeks delving into their design.

From the start, it is worthwhile to state what this course is NOT about.

⚠ This is not a finance course on cryptocurrencies. You should not expect to be taught how to invest in cryptocurrencies or how to become a billionaire overnight.

In fact, the cryptocurrency market is still in its infancy and is incredibly volatile. Investing in cryptocurrencies is a good way to lose money. So this course doesn't offer any investment advice, except in this sentence: I advise you not to invest in cryptocurrencies.

Still though, cryptocurrencies are a worthy topic of study in order to understand their capabilities, potential, and limitations.

In the first unit of this course, we will explore the following topics about cryptocurrencies.

  • Understanding the mechanics of blockchains
  • Understanding why current implementations work
  • Understanding the necessary cryptographic background
  • Exploring applications of blockchains to cryptocurrencies and beyond
  • Understanding limitations of current blockchains

Let's start at the top of the list. What is a blockchain?

It is:

  • A distributed ledger or database
  • Used for building decentralized cryptocurrencies such as Bitcoin
  • Capable of improving many other applications such as distributed Domain Name system (DNS), Public-Key Infrastructure (PKI), stock trade databases, etc.
  • The basis of lots of exciting academic research and industry startups

Transferring digital money¶

Blockchains are a data structure to keep track of transactions.

Here is the motivating scenario for this unit: there are two people, who (following the typical convention in crypto) we will call Alice and Bob.

Alice would like to transfer $1 to Bob.

Let's think about

  • the functionality of how this could work, i.e., how it can work if everyone is honest
  • the security concerns, i.e., ways this could go wrong if someone is malicious
Alice Bob
Alice $\longrightarrow$ Bob

Let's start by exploring how money transfers work in the world of physical banking. Suppose Alice writes a check to Bob. What happens?

Physical check

If everyone is honest, the money is transferred because a bank performs the bookkeeping task of recording every account and its balance.

19th century ledger

Source: Wikipedia

Now let's think about what happens if each party is malicious. For this single transaction, Alice doesn't really care whether Bob is malicious or not; she isn't expecting to receive anything of value.

In the other direction: even if Bob trusts the bank (for now), he may be suspicious that Alice is trying to deceive him. He needs to be able to verify the following three properties.

  1. Alice properly signed the check: To accomplish this goal, Bob can verify the signature himself, say by comparing the check against Alice's identification card.

  2. Alice possesses $1 in her bank account: Bob can ask the bank to verify that Alice's account balance is sufficient for the check to clear. Note that Bob doesn't even need to learn Alice's balance, just the result of the binary predicate.

  3. Alice does not double spend the money by writing a check to someone else at the same time: This is the toughest property to achieve, because neither Bob nor the bank know yet whether Alice has overdrafted her account. They will only discover this later; if so, Alice and Bob can use the bank and the legal system to settle their dispute.

By contrast, the digital world is diverse and international. There is no single entity that everyone in the world trusts, nor is there a single legal system to identify and prosecute thieves and money launderers.

So, if we want a digital system of money, then we need a different approach altogether.

It will take us a few lectures to build up the tools for a digital system of transferring money. First, we need to understand two essential crypto tools: digital signatures and hash functions.

1.3 Introduction to Digital Signatures¶

Today, we only focus on property #1: building a digital version of signing a check.

  • The digital equivalent to "signing a check" is called, appropriately enough, a digital signature scheme.
  • The most common standard is called the digital signature algorithm, or DSA.

A digital signature allows Alice to append a signature $\sigma$ to a message $M$ that satisfies two properties.

  1. Correctness: Everyone in the world can verify whether Alice created the signature $\sigma$.
  2. Unforgeability: Only Alice can produce the signature $\sigma$, and it is bound to the message $M$. Nobody besides Alice can use $\sigma$ in order to forge a signature for any other message $M'$.

Also, there are (at least) two properties that digital signatures do not achieve on their own.

  • Receiver authentication: Because a signature can be verified by anyone, on its own it will not tell Bob that he was the intended target of the message. If Alice wants to convey that Bob is the intended target, she should do so in the message contents $M$.

  • Thwarting replay attacks: If Bob has a digital signature, he can show it to others (e.g., a bank) as many times as he wants. To solve this issue, we will have to be clever in thinking about the format of the message $M$ -- for instance, perhaps it should contain a unique serial number, so the bank can detect if the same check is being cashed twice.